Quality Control and Normalization of Peptide Data with the pmartR Package

Kelly Stratton, Lisa Bramer

2018-12-06

This vignette describes the pmartR package functionality for quality control (QC) processing of mass spectrometry (MS) pan-omics data, in particular proteomics data at the peptide level. This includes data transformation, specification of groups that are to be compared against each other, filtering of features and/or samples, data normalization, and data summarization (correlation, principal components analysis). It is based on example data in the pmartRdata package.

Additional datasets available in the pmartRdata package include the following data types: protein level data, lipidomic data, and metabolomic data. A separate vignette describes the statistical analysis capabilities of pmartR.

Data

Below we load in 3 data.frames from the pmartRdata package:

Here, \(p\) is 17407 and \(n\) is 12. The pep_fdata data.frame contains the sample identifier and the condition designation for each sample, either ‘Infection’ (samples that were infected with some pathogen) or ‘Mock’ (uninfected samples used as a control). The pep_emeta data.frame contains the peptide identifier column (‘Mass_Tag_ID’), the protein identification name to which each mass tag ID maps (‘Protein’), a reference identifier (‘Ref_ID’), and the peptide sequence (Peptide_Sequence). Note that the mass tag IDs are unique (\(n_{peptide} =\) 17407) while the protein identifiers are not (\(n_{protein} =\) 2774) since multiple peptides often map to the same protein.

data("pep_edata")
dim(pep_edata)
## [1] 17407    13
pep_edata[1:6,1:5]
##    Mass_Tag_ID Infection1 Infection2 Infection3 Infection4
## 4         1024         NA         NA         NA         NA
## 6         1047   17953839   20071472   20745779   18206556
## 9         1055  109536335  115459820  106127139   74522014
## 14        1104 1752288782 1796561709 1703186182 2438218572
## 16        1110    2571804    4269824    4852871    2630414
## 23        1214  110239193   82436688  100447189  102006001
data("pep_fdata")
dim(pep_fdata)
## [1] 12  2
head(pep_fdata)
##     SampleID Condition
## 1 Infection1 Infection
## 2 Infection2 Infection
## 3 Infection3 Infection
## 4 Infection4 Infection
## 5 Infection5 Infection
## 6 Infection6 Infection
data("pep_emeta")
dim(pep_emeta)
## [1] 17407     4
head(pep_emeta)
##    Mass_Tag_ID    Protein Ref_ID         Peptide_Sequence
## 6         1024 ACBP_HUMAN    324 K.QATVGDINTERPGMLDFTGK.A
## 15        1047 ALBU_HUMAN    580       T.ECCHGDLLECADDR.A
## 20        1055 RL40_HUMAN   9898     K.TITLEVEPSDTIENVK.A
## 35        1104 ALBU_HUMAN    580      K.KVPQVSTPTLVEVSR.N
## 39        1110  CYC_HUMAN   2918          K.TGPNLHGLFGR.K
## 61        1214  G3P_HUMAN   4480             K.VGVNGFGR.I

Next we create pepData object, which contains the 3 components loaded above. During object creation, we specify the names of the identifier columns in the e_data, f_data, and e_meta components of the data (edata_cname = "Mass_Tag_ID", emeta_cname = "Protein", fdata_cname = "SampleID", respectively), as well as the scale of the data (data_scale = "abundance"). We can use the summary function to get basic summary information for our pepData object.

mypepData <- as.pepData(e_data = pep_edata, f_data = pep_fdata, e_meta = pep_emeta, edata_cname = "Mass_Tag_ID", emeta_cname = "Protein", fdata_cname = "SampleID", data_scale = "abundance")
class(mypepData)
## [1] "pepData"
summary(mypepData)
##                                     
## Class                        pepData
## Unique SampleIDs (f_data)         12
## Unique Mass_Tag_IDs (e_data)   17407
## Unique Proteins (e_meta)        2774
## Missing Observations           82833
## Proportion Missing             0.397

Alternatively, the pmartRdata package contains the corresponding pepData object, which can be loaded as follows.

data("pep_object")
class(pep_object)
## [1] "pepData"
rm(pep_object)

We can use the plot() function to display box plots for each sample in the pepData object.

plot(mypepData)

Additional Data

Also included in the pmartR package are protein, metabolite, and lipid data, in analagous formats to the peptide data described above.

Finally, we include .csv versions of the metabolite data (e_data and f_data), where the sample names have been modified to begin with numbers and contain dashes. These are included to illustrate use of the “check.names” argument to the (and ) functions. If we read in the e_data file without specifying , R automatically puts an “X” at the beginning of the sample names so they do not begin with a number and replaces the dashes with periods. This causes the sample names in e_data to not match the sample names in f_data, and will prevent the creation of a metabData object.

edata <- read.csv(system.file("extdata", "metab_edata_sample_names.csv", package="pmartRdata"), header=TRUE)
names(edata)
##  [1] "Metabolite"   "X1.Mock"      "X2.Mock"      "X3.Mock"     
##  [5] "X1.Infection" "X2.Infection" "X3.Infection" "X4.Infection"
##  [9] "X5.Infection" "X6.Infection" "X7.Infection" "X8.Infection"
## [13] "X9.Infection"
fdata <- read.csv(system.file("extdata", "metab_fdata_sample_names.csv", package="pmartRdata"), header=TRUE)

# Do the sample names match each other? #
all(names(edata)[-1] == fdata$SampleID)
## [1] FALSE

To avoid this, without having to modify the sample names in f_data, we can specify :

edata <- read.csv(system.file("extdata", "metab_edata_sample_names.csv", package="pmartRdata"), header=TRUE, check.names=FALSE)
names(edata)
##  [1] "Metabolite"  "1-Mock"      "2-Mock"      "3-Mock"      "1-Infection"
##  [6] "2-Infection" "3-Infection" "4-Infection" "5-Infection" "6-Infection"
## [11] "7-Infection" "8-Infection" "9-Infection"
# Do the sample names match each other? #
all(names(edata)[-1] == fdata$SampleID)
## [1] TRUE

Workflow

Once the pepData object is created, a typical QC workflow follows the figure below.

Figure 1. Quality control and processing workflow in pmartR package.

Figure 1. Quality control and processing workflow in pmartR package.

We will walk through the steps in this QC workflow using mypepData.

Format Data

Sometimes omics abundance data contains 0s when the particular biomolecule was not detected for a given sample. It is our practice to replace any 0s with NAs as opposed to imputing the missing values. We strongly prefer NAs to imputation because we cannot know whether the biomolecule is not present in the sample, simply was below the limit of detection, or was present but missing from our data at random. The function edata_replace() can be used to replace values in the pepData object.

mypepData <- edata_replace(mypepData, x = 0, y = NA)
## 0 instances of 0 have been replaced with NA

In this dataset, there were no 0s so nothing was replaced.

We also highly recommend log transforming the data prior to analysis. pmartR supports log2, log10, and natural logarithm transformations. The edata_transform() function provides this capability. We can also use edata_trasnform() to transform the data back to the abundance scale if needed. Note that the scale of the data is stored and automatically updated in the data_info$data_scale attribute of the pepData object. Below we log10 transform the data, then return to the abundance scale, and finally settle on the log2 scale.

mypepData <- edata_transform(mypepData, data_scale = "log10")
attributes(mypepData)$data_info$data_scale
## [1] "log10"
mypepData <- edata_transform(mypepData, data_scale = "abundance")
attributes(mypepData)$data_info$data_scale
## [1] "abundance"
mypepData <- edata_transform(mypepData, data_scale = "log2")
attributes(mypepData)$data_info$data_scale
## [1] "log2"
plot(mypepData)

Finally, we are preparing this data for statistical analysis where we will compare the samples belonging to one group to the samples belonging to another, and so we must specify the group membership of the samples. We do this using the group_designation() function, which modifies our pepData object and returns an updated version of it. Up to two main effects and up to two covariates may be specified, with one main effect being required at minimum. For the example data, we only have one option - specify the main effect to be “Condition” - since we do not have any additional information about the samples. Certain functions we will use below require that groups have been designated via the group_designation() function.

mypepData <- group_designation(mypepData, main_effects = "Condition", covariates = NULL)

The group_designation() function creates an attribute of the dataset as follows:

attributes(mypepData)$group_DF
##      SampleID     Group
## 1  Infection1 Infection
## 2  Infection2 Infection
## 3  Infection3 Infection
## 4  Infection4 Infection
## 5  Infection5 Infection
## 6  Infection6 Infection
## 7  Infection7 Infection
## 8  Infection8 Infection
## 9  Infection9 Infection
## 10      Mock1      Mock
## 11      Mock2      Mock
## 12      Mock3      Mock
plot(mypepData, color_by = "Condition", bw_theme=TRUE)

Filter Peptides

It is often good practice to filter out peptides that do not meet certain criteria, and we offer 5 different filters. Each of the filtering functions calculates metric(s) that can be used to filter out peptides and returns an S3 object. Using the summary() function produces a summary of the metric(s) and using the plot() function produces a graph. Filters that require a user-specified threshold in order to actually filter out peptides have corresponding summary and plot methods that take optional argument(s) to specify that threshold. Once one of the 5 peptide level filter functions has been called, the results of that function can be used in conjunction with the applyFilt() function to actually filter out peptides based on the metric(s) and user-specified threshold(s) and create a new, filtered pepData object.

Proteomics Filter

The proteomics filter is applicable only to peptide level data that contains the e_meta component, as it counts the number of peptides that map to each protein and/or the number of proteins to which each individual peptide maps. It returns a list of two character vectors, the first, peptides_filt, giving degenerate peptide names. The second, proteins_filt, gives the names of proteins which no longer have peptides mapping to them in the dataset. This filter does not require a user-specified threshold.

myfilter <- proteomics_filter(mypepData)
summary(myfilter)
## 
## Summary of Proteomics Filter 
## 
##              Obs. Per Peptide Obs. Per Protein
## Min.                        1                1
## 1st Qu.                     1                1
## Median                      1                3
## Mean                        1 6.27505407354001
## 3rd Qu.                     1                7
## Max.                        1              333
##                                               
## Filtered                    0                0
## Not Filtered            17407             2774
plot(myfilter, bw_theme=TRUE)

With the current dataset, the majority of proteins have one peptide mapping to them (graph on the left) and all peptides map to a single protein (e.g. no degenerate peptides; graph on the right).

Molecule Filter

The molecule filter allows the user to remove from the dataset any biomolecule not seen in at least min_num of the samples. The user may specify a threshold of the minimum number of times each biomolecule must be observed across all samples; the default value is 2.

myfilter <- molecule_filter(mypepData)
plot(myfilter, bw_them = TRUE)

summary(myfilter, min_num = 3)
## 
## Summary of Molecule Filter 
## ----------------------------------                                  
## Minimum Number:                  3
## Filtered:                     3342
## Not Filtered:                14065
##                                   
## Molecule Observation Counts:      
## 
##  num_observations frequency_counts
##                 1             1814
##                 2             3342
##                 3             4911
##                 4             5934
##                 5             6737
##                 6             7530
##                 7             8389
##                 8             9279
##                 9            10383
##                10            11527
##                11            12987
##                12            17407

Setting the threshold to 3, we would filter 3,342 peptides out of the dataset. If we’d like to make the filter less stringent, we could use a threshold of 2 and only filter 1,872 peptides.

summary(myfilter, min_num = 2)
## 
## Summary of Molecule Filter 
## ----------------------------------                                  
## Minimum Number:                  2
## Filtered:                     1814
## Not Filtered:                15593
##                                   
## Molecule Observation Counts:      
## 
##  num_observations frequency_counts
##                 1             1814
##                 2             3342
##                 3             4911
##                 4             5934
##                 5             6737
##                 6             7530
##                 7             8389
##                 8             9279
##                 9            10383
##                10            11527
##                11            12987
##                12            17407
#plot(myfilter, min_num = 2)

mypepData <- applyFilt(filter_object = myfilter, mypepData, min_num = 2)
summary(mypepData)
##                                     
## Class                        pepData
## Unique SampleIDs (f_data)         12
## Unique Mass_Tag_IDs (e_data)   15593
## Unique Proteins (e_meta)        2511
## Missing Observations           62879
## Proportion Missing             0.336
## Samples per group: Infection       9
## Samples per group: Mock            3

Coefficient of Variation Filter

The coefficient of variation (CV) filter calculates the pooled CV values as in (Ahmed 1995).

The user can then specify a CV threshold, above which peptides are removed.

myfilter <- cv_filter(mypepData)
summary(myfilter, cv_threshold = 150)
## 
## Summary of Coefficient of Variation (CV) Filter
## ----------------------
## CVs:
## 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   0.04778  24.85999  32.28337  34.83979  42.00244 168.69275 
## 
## Total NAs: 359
## Total Non-NAs: 15234 
## 
## Number Filtered Biomolecules: 2
plot(myfilter, cv_threshold = 150, title_size = 15)

mypepData <- applyFilt(filter_object = myfilter, mypepData, cv_threshold = 150)
summary(mypepData)
##                                     
## Class                        pepData
## Unique SampleIDs (f_data)         12
## Unique Mass_Tag_IDs (e_data)   15591
## Unique Proteins (e_meta)        2511
## Missing Observations           62871
## Proportion Missing             0.336
## Samples per group: Infection       9
## Samples per group: Mock            3

Custom Filter

Sometimes it is known a priori that certain peptides or samples should be filtered out of the dataset prior to statistical analysis. Perhaps there are known contaminant proteins, and so peptides mapping to them should be removed, or perhaps something went wrong during sample preparation for a particular sample. On the other hand, it may be preferred to specify peptides or samples to keep (removing those not explicitly specified), and this can also be accomplished. Keep in mind that both ‘remove’ and ‘keep’ arguments cannot be specified together; either ‘remove’ arguments only or ‘keep’ arguments only may be specified in a single call to custom_filter().

Here, we demonstrate the removal of the peptide with Mass Tag ID 1047 and sample Infection 1 as an example.

myfilter <- custom_filter(mypepData, e_data_remove = 1047, e_meta_remove = NULL, f_data_remove = "Infection1")
summary(myfilter)
## 
## Summary of Custom Filter
## 
##                       Filtered Remaining Total
## SampleIDs (f_data)           1        11    12
## Mass_Tag_IDs (e_data)        1     15590 15591
## Proteins (e_meta)            0      2511  2511
mypepData_temp <- applyFilt(filter_object = myfilter, mypepData)
summary(mypepData_temp)
##                                     
## Class                        pepData
## Unique SampleIDs (f_data)         11
## Unique Mass_Tag_IDs (e_data)   15590
## Unique Proteins (e_meta)        2511
## Missing Observations           58431
## Proportion Missing             0.341
## Samples per group: Infection       8
## Samples per group: Mock            3
rm(mypepData_temp)

We also demonstrate how to use this filter with the ‘keep’ option by keeping the samples Infection 1, Infection 2, Infection 3, Infection 4 and Infection 5.

myfilter2<- custom_filter(mypepData, e_data_keep = NULL, e_meta_keep = NULL, f_data_keep = c("Infection1", "Infection2", "Infection3", "Infection4", "Infection5"))
summary(myfilter2)
## 
## Summary of Custom Filter
## 
##                        Kept Discarded Total
## SampleIDs (f_data)        5         7    12
## Mass_Tag_IDs (e_data) 15591         0 15591
## Proteins (e_meta)      2511         0  2511
mypepData_temp2<- applyFilt(filter_object = myfilter2, mypepData)
summary(mypepData_temp2)
##                                     
## Class                        pepData
## Unique SampleIDs (f_data)          5
## Unique Mass_Tag_IDs (e_data)   15591
## Unique Proteins (e_meta)        2511
## Missing Observations           25005
## Proportion Missing             0.321
## Samples per group: Infection       5
rm(mypepData_temp2)

Note that there is a summary() method for objects of type custom_filt, but no plot() method.

IMD-ANOVA Filter

The IMD-ANOVA filter removes peptides that do not have sufficient data for the statistical tests available in the pmartRstat package; these are ANOVA (quantitative test) and an independence of missing data (IMD) using a g-test (qualitative test). When using the summary(), plot(), and applyFilt() functions, you can specify just one filter (ANOVA or IMD) or both, depending on the tests you’d like to perform later. Using this filter speeds up the process of performing the statistical tests.

myfilter <- imdanova_filter(mypepData)

Here we consider what the filter would look like for both ANOVA and IMD.

summary(myfilter, min_nonmiss_anova = 2, min_nonmiss_gtest = 3)
## 
## Summary of IMD Filter
##                           
## Total Observations:  15591
## Filtered:             2052
## Not Filtered:        13539
#plot(myfilter, min_nonmiss_anova = 2, min_nonmiss_gtest = 3)

Here we consider what the filter would look like for just ANOVA.

summary(myfilter, min_nonmiss_anova = 2)
## 
## Summary of IMD Filter
##                           
## Total Observations:  15591
## Filtered:             5511
## Not Filtered:        10080
#plot(myfilter, min_nonmiss_anova = 2)

Here we consider what the filter would look like for just IMD.

summary(myfilter, min_nonmiss_gtest = 3)
## 
## Summary of IMD Filter
##                           
## Total Observations:  15591
## Filtered:             2280
## Not Filtered:        13311
#plot(myfilter, min_nonmiss_gtest = 3)

Now we apply the filter for both ANOVA and IMD.

mypepData <- applyFilt(filter_object = myfilter, mypepData, min_nonmiss_anova = 2, min_nonmiss_gtest = 3)

summary(mypepData)
##                                     
## Class                        pepData
## Unique SampleIDs (f_data)         12
## Unique Mass_Tag_IDs (e_data)   13539
## Unique Proteins (e_meta)        2253
## Missing Observations           42875
## Proportion Missing             0.264
## Samples per group: Infection       9
## Samples per group: Mock            3

Filter Samples

To identify any samples that are potential outliers or anomalies (due to sample quality, preparation, or processing circumstances), we use a robust Mahalanobis distance (rMd) (Matzke et al. 2011) score based on 2-5 metrics. The possible metrics are:

The rMd score can be mapped to a p-value, and a p-value threshold used to identify potentially outlying samples. In general, for proteomics data we recommend using correlation, proportion missing, MAD, and skew. A plot of the rMd values for each sample is generated, and specifying a value for ‘pvalue_threshold’ results in a horizontal line on the plot, with samples above the line slated for filtering at the given threshold.

myfilter <- rmd_filter(mypepData, metrics = c("Correlation", "Proportion_Missing", "MAD", "Skewness"))
plot(myfilter, bw_theme=TRUE)

summary(myfilter, pvalue_threshold = 0.001)
## 
## Summary of RMD Filter
## ----------------------
## P-values:
## 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000001 0.2686711 0.4152817 0.5050117 0.8306939 0.9707249 
## 
## Metrics Used: MAD, Skewness, Corr, Proportion_Missing 
## 
## Filtered Samples: Infection5, Infection8
plot(myfilter, pvalue_threshold = 0.001, bw_theme=TRUE)

Using the output from the summary() function, we can explore the outliers identified to see what metrics are driving their outlier-ness. Box plots for each metric are graphed, with the specified sample marked with an ‘X’.

plot(myfilter, sampleID = "Infection5", bw_theme=TRUE)

plot(myfilter, sampleID = "Infection8", bw_theme=TRUE)

We can see that Infection5 has the highest MAD, low skewness, and almost the lowest correlation, but the lowest proportion missing compared to the other samples. Infection 8 has MAD close to the median MAD value, the highest skew, the lowest correlation, and the highest proportion of missing data. We can use this information to determine whether to remove either or both of these samples. Sometimes it is also useful to look at additional data summaries to inform outlier removal, such as a principal components plot or correlation heat map (below).

Normalize Data

The next step in a typical workflow is to normalize the data.

Normalization approaches consist of a subset method and a normalization function (Åstrand 2003),(Gautier et al. 2004). Available subset methods include:

Available normalization functions include:

There are two ways to go about normalizing data in the pmartR package. If you know the normalization approach you’d like to use, you can directly specify it in the normalize_data() function. Or, if you want to see how compatible different normalization approaches are with your dataset, you can use the spans_procedure() function.

We can go straight to the normalization, if we know what approach we want to use.

# Normalize using all peptides and median centering #
norm_object <- normalize_global(omicsData = mypepData, subset_fn = "all", norm_fn = "median", apply_norm = FALSE, backtransform = TRUE)
norm_data <- normalize_global(omicsData = mypepData, subset_fn = "all", norm_fn = "median", apply_norm = TRUE, backtransform = TRUE)
plot(norm_data, title_plot="Normalized Data: Median Centering Using all Peptides", title_size=12)

# Normalize using RIP 0.2 and median centering #
norm_object <- normalize_global(omicsData = mypepData, subset_fn = "rip", params=list(rip=0.2), norm_fn = "median", apply_norm = FALSE, backtransform = TRUE)
norm_data <- normalize_global(omicsData = mypepData, subset_fn = "rip", params=list(rip=0.2), norm_fn = "median", apply_norm = TRUE, backtransform = TRUE)
plot(norm_data, title_plot="Normalized Data: Median Centering Using RIP 0.2 Peptides", title_size=12)

Below we use SPANS to select a normalization approach and then normalize the data.

# returns a dataframe arranged by descending SPANS score
spans_result <- spans_procedure(mypepData)
## [1] "Finished creating background distribution, moving to method candidate selection"
## [1] "Finished method candidate selection, proceeding to score selected methods."
## [1] "Finished scoring selected methods"
summary(spans_result)
## 
## Summary of spans procedure
## 
## Highest ranked method(s)
##   subset_method normalization_method SPANS_score parameters
## 1           rip               median           1        0.1
## 2           rip               median           1       0.15
## 3           rip               median           1        0.2
## 4           rip               median           1       0.25
## 5       ppp_rip               median           1    0.1;0.1
## 6       ppp_rip               median           1  0.25;0.15
##   mols_used_in_norm passed_selection rank
## 1              1554             TRUE    1
## 2              1369             TRUE    1
## 3              1158             TRUE    1
## 4               987             TRUE    1
## 5              3195             TRUE    1
## 6              3195             TRUE    1
## 
## Number of input methods:  68
## Number of methods scored:  30
## Average molecules used in normalization:  5277
plot(spans_result)
# a list of the parameters for any normalization procedure with the best SPANS score
best_params <- get_spans_params(spans_result)

# there are a few ties, all using ppp_rip, well select the method that uses median normalization and parameters ppp = 0.1 and rip = 0.1
subset_fn = best_params[[1]]$subset_fn
norm_fn = best_params[[1]]$norm_fn
params = best_params[[1]]$params

# we now pass the extracted subset function, normalization function, and parameters from SPANS to normalize_global()
norm_object <- normalize_global(omicsData = mypepData, subset_fn = subset_fn, norm_fn = norm_fn, params = params)

Summarize Data

The pmartR package contains various methods for data summarization and exploration that can be used as part of the QC process: numeric summaries and associated plots (via the edata_summary() function), sequential projection pursuit principal components analysis (sppPCA) for dimension reduction (via thedim_reduction() function), and a correlation heat map (via the cor_result() function)(Stacklies et al. 2007). As for missing data summarization there are also functions to collect and plot missing data information. Missing data collection and plotting functions include, missingval_result(), plot_missingval(), missingval_scatterplot() and missingval_heatmap().

Numeric Summaries

We can generate numeric summaries of our data by either sample or molecule. The edata_summary() function computes the mean, standard deviation, median, percent of observations for which a value was observed, the minimum value, and the maximum value.

edata_summary(omicsData = norm_data, by = "sample", groupvar = NULL)
## mean
##        sample             mean
## 1  Infection1 21.7971083294077
## 2  Infection2  21.731366523126
## 3  Infection3 21.8032232443703
## 4  Infection4 21.8774087539977
## 5  Infection5 21.4257258525994
## 6        ----             ----
## 8  Infection8 22.4709412800765
## 9  Infection9  22.179932218701
## 10      Mock1 22.5630135430607
## 11      Mock2 22.4045599582422
## 12      Mock3  22.350225943675
## 
## std_dev
##        sample               sd
## 1  Infection1 2.41455648093841
## 2  Infection2 2.31224701980782
## 3  Infection3 2.35043331396441
## 4  Infection4 2.44617068574648
## 5  Infection5  2.5535287332081
## 6        ----             ----
## 8  Infection8 2.26180598552078
## 9  Infection9 2.36212527529442
## 10      Mock1 2.28423295890622
## 11      Mock2 2.18183571765243
## 12      Mock3 2.28027627057377
## 
## median
##        sample           median
## 1  Infection1 21.7706905193056
## 2  Infection2 21.6715407709563
## 3  Infection3 21.7490538629397
## 4  Infection4 21.8551805308581
## 5  Infection5  21.465407352971
## 6        ----             ----
## 8  Infection8 22.4154539852953
## 9  Infection9  22.122666472151
## 10      Mock1 22.5353174031891
## 11      Mock2 22.3226552138985
## 12      Mock3 22.3085808857998
## 
## pct_obs
##        sample           pct_obs
## 1  Infection1 0.796513775020312
## 2  Infection2 0.790014033532757
## 3  Infection3 0.730334588965212
## 4  Infection4 0.691705443533496
## 5  Infection5 0.802348770219366
## 6        ----              ----
## 8  Infection8 0.529359627742078
## 9  Infection9 0.611123421227565
## 10      Mock1 0.789940172834035
## 11      Mock2 0.796661496417756
## 12      Mock3 0.799098899475589
## 
## minimum
##        sample              min
## 1  Infection1 12.1066097831752
## 2  Infection2 12.3753923747856
## 3  Infection3 13.4885281669697
## 4  Infection4 12.2536652266729
## 5  Infection5  11.026508295961
## 6        ----             ----
## 8  Infection8 14.3702269404971
## 9  Infection9 13.4319078573245
## 10      Mock1 13.0850982141284
## 11      Mock2 13.6650934960137
## 12      Mock3 11.6186174442457
## 
## maximum
##        sample              max
## 1  Infection1 30.9532147548287
## 2  Infection2 30.7628477282651
## 3  Infection3 30.8984015884978
## 4  Infection4 31.2420886871443
## 5  Infection5 30.9490717160277
## 6        ----             ----
## 8  Infection8 32.3710788536963
## 9  Infection9 32.1531258831469
## 10      Mock1  31.275411454618
## 11      Mock2 31.2032188575782
## 12      Mock3 31.1871443967351
#edata_summary(omicsData = norm_data, by = "molecule", groupvar = "Condition")
#edata_summary(omicsData = norm_data, by = "molecule", groupvar = NULL)

Probabilistic Principal Components Analysis

Probabilistic is a PCA algorithm that can be used in the presence of missing data. We use the dim_reduction() function in pmartR, specifying the dataset and the number of principal components to return (default value of 2 principal components). The summary() and plot() functions operate on the results of the dim_reduction() function.

Note that dim_reduction() requires that the group_designation() function has been run.

pca_results <- dim_reduction(norm_data, k = 2)

summary(pca_results) 
## Summary of 'dimRes' Object
##     R_squaredPC1     0.528PC2     0.341
plot(pca_results, bw_theme=TRUE)

Correlation Heatmap

Pearson correlation between samples is calculated based on pairwise complete biomolecules.

correlation_results <- cor_result(norm_data)

summary(correlation_results)
## Summary of Correlation Matrix (Upper Triangle)
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7804  0.8688  0.9143  0.9032  0.9496  0.9722
plot(correlation_results)

Missing Data

Patterns of missing data can be explored using the missingval_result() and plot_missingval() functions.

results <- missingval_result(mypepData)


plot(results, type = "bySample")

plot(results, type = "byMolecule")

In addition, the missingval_scatterplot() and missingval_heatmap() functions provide more views of the missing data.

missingval_scatterplot(mypepData, palette = "Set1")

missingval_heatmap(mypepData)

Protein Quantification Methods

Protein quantification can be done using the protein_quant() function, either with or without accounting for different isoforms of the proteins, also called ‘proteoforms’. Regardless of whether the user is accounting for protein isoforms, they must specify a method for rolling peptides up to proteins (one of “rollup”, “rollup”, “qrollup”, or “zrollup”) and a function to use for combining the peptide-level data (either “mean” or “median”). When the user does wish to account for protein isoforms, they have two options for doing so: Bayesian Proteoform Quantification (BP-Quant) (Webb-Robertson et al. 2014) or Protein Quantification via Peptide Quality-Control (PQPQ).

The protein_quant() function returns an object of class proData.

Rollup Methods

The rollup method takes either the mean or median of all peptides mapping to a given protein, and sets that value as the relative protein abundance.

In the rrollup method, all peptides that map to a single protein are scaled based on a chosen reference peptide, which is the peptide with most observations. Next the average or mean of the scaled peptides is set as the relative protein abundance. (Matzke et al. 2013)

The qrollup method starts with all peptides that map to a single protein. Next peptides are chosen according to an abundance cutoff value and the average or mean of the scaled peptides is set as the protein abundance.

In the zrollup method, scaling is done similarly to the z-score formula (except with medians instead of means). The scaling formula is applied to peptides that map to a single protein and then the mean or median of the scaled peptides is set as protein abundance (from DAnTE article). The rollup method is similar to rrollup method, except there is no scaling involved in these methods. Either the mean or median is applied to all peptides that map to a single protein to obtain protein abundance. (Polpitiya et al. 2008)

print(mypepData)

mypepdata_median<- protein_quant(pepData = mypepData, method = "rollup", combine_fn = "median")
print(mypepdata_median)
rm(mypepdata_median)

mypepdata_rrollup<- protein_quant(pepData = mypepData, method = "rrollup")
print(mypepdata_rrollup)
rm(mypepdata_rrollup)

References

Ahmed, SE. 1995. “A Pooling Methodology for Coefficient of Variation.” Sankhyā: The Indian Journal of Statistics, Series B. JSTOR, 57–75.

Åstrand, Magnus. 2003. “Contrast Normalization of Oligonucleotide Arrays.” Journal of Computational Biology 10 (1). Mary Ann Liebert, Inc.: 95–102.

Gautier, Laurent, Leslie Cope, Benjamin M Bolstad, and Rafael A Irizarry. 2004. “Affy—analysis of Affymetrix Genechip Data at the Probe Level.” Bioinformatics 20 (3). Oxford University Press: 307–15.

Matzke, Melissa M, Joseph N Brown, Marina A Gritsenko, Thomas O Metz, Joel G Pounds, Karin D Rodland, Anil K Shukla, et al. 2013. “A Comparative Analysis of Computational Approaches to Relative Protein Quantification Using Peptide Peak Intensities in Label-Free Lc-Ms Proteomics Experiments.” Proteomics 13 (3-4). Wiley Online Library: 493–503.

Matzke, Melissa M, Katrina M Waters, Thomas O Metz, Jon M Jacobs, Amy C Sims, Ralph S Baric, Joel G Pounds, and Bobbie-Jo M Webb-Robertson. 2011. “Improved Quality Control Processing of Peptide-Centric Lc-Ms Proteomics Data.” Bioinformatics 27 (20). Oxford University Press: 2866–72.

Polpitiya, Ashoka D, Wei-Jun Qian, Navdeep Jaitly, Vladislav A Petyuk, Joshua N Adkins, David G Camp, Gordon A Anderson, and Richard D Smith. 2008. “DAnTE: A Statistical Tool for Quantitative Analysis of-Omics Data.” Bioinformatics 24 (13). Oxford University Press: 1556–8.

Stacklies, Wolfram, Henning Redestig, Matthias Scholz, Dirk Walther, and Joachim Selbig. 2007. “PcaMethods—a Bioconductor Package Providing Pca Methods for Incomplete Data.” Bioinformatics 23 (9). Oxford University Press: 1164–7.

Webb-Robertson, Bobbie-Jo M, Melissa M Matzke, Susmita Datta, Samuel H Payne, Jiyun Kang, Lisa M Bramer, Carrie D Nicora, et al. 2014. “Bayesian Proteoform Modeling Improves Protein Quantification of Global Proteomic Measurements.” Molecular & Cellular Proteomics 13 (12). ASBMB: 3639–46.